## CALIFORNIA POLYTECHNIC STATE UNIVERSITY

Computer Engineering Program

CPE 315 Computer Architecture Sample Final

## Final Exam

Name:

| Question | Points | Score |
|----------|--------|-------|
| 1        | 25     |       |
| 2        | 15     |       |
| 3        | 25     |       |
| 4        | 20     |       |
| 5        | 25     |       |
| 6        | 20     |       |
| Total:   | 130    |       |

1. **Performance** You are to improve performance on an existing system based on a single-chip microprocessor having the following parameters.

| Instruction Type | % Dyn. Instructions | Cycles |
|------------------|---------------------|--------|
| Int              | 30                  | 1      |
| FP               | 40                  | 4      |
| Memory           | 20                  | 5      |
| Branch           | 10                  | 3      |

(a) (5 points) What is the average CPI?

(b) (8 points) By using an optimizing compiler, the number of FP instructions is reduced by 25% and Memory instructions by 50%. What is the new average CPI when using the optimizing compiler? Note the instruction mix is changed!

(c) (5 points) How much speedup do you expect from using the optimizing compiler in Part (b) compared to the unoptimized version in Part (a)?

(d) (7 points) If the Memory unit can be made twice as fast, what is the speedup of this program (the clock speed is unchanged) compared to the original (non-optimized, Part (a)) version of the code? *Hint: Use Amdahl's Law, and*  $F \neq 20\%$ .

2. (15 points) **ARM ISA and Register Conventions** Write the equivalent ARM Thumb code for the C function below. You must follow correct ARM register usage convention. You should not make any assumptions about the functions func1 and func2 other than they also follow correct register usage convention. Your solution may only use registers \$r0-\$r7, \$1r and \$pc, and should be less than 25 instructions. Comment your code to receive full credit. Specifically, describe what each register you use represents.

For the reorder function, parameters a and b are the start addresses of arrays, each with size integers. Note that even though this function doesn't return a value, modifications to the values in a and b will be seen by the calling function.

```
int func1 (int a, int b);
int func2 (int a, int b);

void reorder (int *a, int *b, int size)
{
   for (int i=0; i < size; i++) {
      int num1 = a[i], num2 = b[i];
      a[i] = func1(num1, num2);
      b[i] = func2(num1, num2);
   }
}</pre>
```

## 3. Caches

- (a) For a 512KB 16-way set associative cache that uses a 32-bit address and 32 bytes per line. Calculate the following:
  - i. (2 points) How many bits are used for the byte offset?
  - ii. (5 points) How many bits are used for the index field?
  - iii. (2 points) How many bits are used for the tag?

(b) (16 points) Consider a 64-byte, direct-mapped cache with an 8-byte block size. For the following access pattern of byte addresses, show the final contents of the tag fields in the cache, and determine if each access is a hit or a miss.

| Address:     | 0x24 | 0x32 | 0x18 | 0x08 | 0x10 | 0x11 | 0x12 | 0x20 |
|--------------|------|------|------|------|------|------|------|------|
| Hit or Miss? |      |      |      |      |      |      |      |      |

| Index | Tag |
|-------|-----|
| 000   |     |
| 001   |     |
| 010   |     |
| 011   |     |
| 100   |     |
| 101   |     |
| 110   |     |
| 111   |     |

- 4. **AMAT** Show all your work.
  - (a) (10 points) Compute the AMAT of the following cache system:
    - L1 Has an access time of 3 cycles, and hits 92% of the time.
    - L2 Has an access time of 15 cycles, and hits 97% of the time.
    - ${\bf L3}$  Has an access time of 30 cycles, and hits 99.8% of the time.
    - Main Memory has an access time of 200 cycles.

The access time is required for all accesses, hit or miss.

(b) (10 points) If an I-cache has a 95% hit rate, a hit time of 3 cycles, and a D-cache has a 85% hit rate and a hit time of 3 cycles, and both caches have a miss penalty of 40 cycles, what is average memory access time for a program with 25% memory access instructions (e.g., loads, stores, pushes, pops, etc.)?

## 5. Pipelines

(a) For this part, you'll be examining the following ARM code. For all these questions, you may ignore the fill-cost for the pipeline.

```
r2 \;,\;\; r2 \;,\;\; \#1
                add
    loop:
                ldr
                           r0, [r3]
                \mathbf{sub}
                           r0, r0, \#2
                \mathbf{sub}
                           r1, r3, #4
                ldr
                           r1, [r1]
                lsl
                           r1, r1, #1
                add
                           r1, r0, r1
                           r1\;,\;\;[\;r3\;]
                str
                add
                           r3, r3, #4
                           r2, r4
                cmp
10
                \mathbf{blt}
                           loop
11
```

i. (6 points) If this loop is executed exactly 10 times, what is the CPI for the loop? Assume the branch is dynamically predicted to be taken in all iterations, and that a mispredicted branch takes 6 cycles.

ii. (4 points) Repeat the above calculation if this loop is executed exactly 100 times.

iii. (15 points) Modify the loop only by reordering instructions to minimize CPI. Compute what the minimized CPI is for 10 iterations.

- 6. Virtual Memory In class we learned about how to implement a hierarchical page table structure to allow all page translation and mappings for 32-bit addresses to also be contained in physical pages. However, the lab workstations are 64-bit systems, and use a 48-bit virtual address (the most significant 16 bits are not used for memory addressing), with 4 KByte pages. With a 64-bit system like the lab machines, each entry in a page table or page directory will be 64-bits, or 8 bytes.
  - (a) (16 points) Show how to implement a hierarchical page translation system with a 48-bit address such that all page translations and mappings fit exactly into 4 KByte pages (e.g., all tables, directories, etc. occupy exactly one page).

Show how/where the 48-bit address is divided, and **draw a basic diagram** that illustrates how the hierarchical system works. You do not need to include any TLBs in this diagram.

| 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 | 1109 8 7 6 5 4 3 2 1 0 |
|-------------------------------------------------------------------------------------------------------------|------------------------|
|                                                                                                             | Page Offset - 12 Bits  |

<sup>(</sup>b) (4 points) In the worst case, how many accesses to physical memory must be made for each memory access?